The goal of this session is to give you a taste of the different features of the DataScience Platform, including conducting analyses, publishing reports, scheduling scripts, and deploying models.
Imagine that you are a data scientist at a company that has to perform dynamic inventory management. An example of that would be a ride-sharing company where you want to know which parts of a city to direct your drivers to depending on the time of day and other factors like the weather.
Here we’ll perform some analysis in Jupyter and then publish these findings as a Report that a business user will find easy to consume.
The data is in data/processed_uber_nyc.RData and contains two dataframes:
agg_data
zone_polys
The source of the data for this exercise is the Uber Pickups in New York City dataset by FiveThirtyEight. Pickup data for 20 million pickeups are aggregated by hour, date, and taxi zone (i.e., an approximate neighborhood) and enriched with calendar and weather data. More detailed information about each dataframe is below.
This dataframe contains information about the number of pickups.
Fields:
locationID: unique ID for each taxi zone
date
hour: 24H format
borough: Borough that the zone is located in (e.g. Manhattan, Boorklyn, Queens)
zone: Name of the taxi zone (e.g. Times Sq, Chinatown, Central Harlem)
picksups: Number of pickups
day: Day of week (e.g. Mon, Tue, Wed)
is_holiday: Whether that day was a holiday (Boolean)
mean_temp_F: Mean temperature that day in Fahrenheit
## Source: local data frame [3 x 9]
## Groups: locationID, date, hour, borough [3]
##
## locationID date hour borough zone pickups day
## <int> <chr> <chr> <fctr> <fctr> <int> <chr>
## 1 1 2014-04-01 03 EWR Newark Airport 2 Tue
## 2 1 2014-04-01 04 EWR Newark Airport 4 Tue
## 3 1 2014-04-01 05 EWR Newark Airport 4 Tue
## # ... with 2 more variables: is_holiday <lgl>, mean_temp_F <int>
This is a dimension table that describes the boundaries of each taxi zone.
Fields:
long: Longitude
lat: Latitude
order: Rank of point when drawing boundary
hole: Whether to plot a hole in that location (Boolean)
piece: The piece of the zone that the point is associated with
id: ID of zone. Same as locationID in agg_data
group: Group that the point belongs to
## long lat order hole piece id group
## 1 -74.18445 40.69500 1 FALSE 1 1 0.1
## 2 -74.18449 40.69509 2 FALSE 1 1 0.1
## 3 -74.18450 40.69518 3 FALSE 1 1 0.1
Insights:
Lower Mahanttan experiences highest demand
Demand is also high at airports (JFK and La Guardia)
Not much activity in the outer boroughs
It looks like many of the neighborhoods show similar pickup patterns. By clustering the neighborhoods, we will likely improve predictive and computational performance of the model.
To cluster the neighborhoods, we will perform k-means clustering on the hourly pickup patterns. We will use the elbow method to pick the most suitable number of clusters.
4 appears to be the most appropriate number of neighborhood clusters. Let’s visualize the clusters.